Dataset used:
1. National Longitudinal Study of Adolescent to Adult Health (Add Health) Wave I, 1994-1995 and
https://dataverse.unc.edu/dataset.xhtml?persistentId=doi:10.15139/S3/11900
2. National Longitudinal Study of Adolescent to Adult Health (Add Health) Wave IV, 2008
https://dataverse.unc.edu/dataset.xhtml?persistentId=doi:10.15139/S3/11920
The National Longitudinal Study of Adolescent to Adult Health (Add Health) is a longitudinal study of a nationally representative sample of adolescents in grades 7-12 in the United States during the 1994-95 school year. Add Health is a school-based longitudinal study of a nationally-representative sample of adolescents in grades 7-12 in the United States in 1994-95. Data have been collected from adolescents, their fellow students, school administrators, parents, siblings, friends, and romantic partners through multiple data collection components, including four respondent in-home interviews. In addition, existing data bases with information about respondents’ neighborhoods and communities have been merged with Add Health data, including variables on income and poverty, unemployment, availability and utilization of health services, crime, church membership, and social programs and policies.
The Add Health cohort has been followed into young adulthood with four in-home interviews, the most recent in 2008, when the sample was aged 24-32*. Add Health combines longitudinal survey data on respondents’ social, economic, psychological and physical well-being with contextual data on the family, neighborhood, community, school, friendships, peer groups, and romantic relationships, providing unique opportunities to study how social environments and behaviors in adolescence are linked to health and achievement outcomes in young adulthood. The fourth wave of interviews expanded the collection of biological data in Add Health to understand the social, behavioral, and biological linkages in health trajectories as the Add Health cohort ages through adulthood.
This study spans over 14 years from 1994 until most recent year 2008
Wave I The public use dataset for Wave I contains information collected in 1994-95 from Add Health’s nationally representative sample of adolescents. This dataset includes Wave I respondents and consists of one-half of the core sample, chosen at random, and one-half of the oversample of African-American adolescents with a parent who has a college degree. The total number of Wave I respondents in this dataset is approximately 6,500.
The Wave I public use dataset includes information from each of the following sources (as available): In-School Questionnaire Wave I In-Home Interview Add Health Picture Vocabulary Test (AHPVT), an abbreviated version of the Peabody Picture Vocabulary Test—Revised, with age-standardized scores for adolescent respondents Wave I Parent Questionnaire Contextual data In-school network data Weights
Wave IV Wave IV was designed to study the developmental and health trajectories across the life course of adolescence into young adulthood. Taking place in 2008, approximately 92.5% of the original Wave I respondents were located and 80.3% of eligible cases were interviewed. The Wave IV public use file contains data on 5,114 respondents, aged 24 to 32*. In Wave IV, biological data was also gathered in an attempt to acquire a greater understanding of predisease pathway s, with a specific focus on obesity, stress, and health risk behavior.
The Wave IV public use dataset includes the following data files: Wave IV In-home Interview File: variables from the in-home interview, including anthropometric measures Relationship Data Pregnancy Table File Live Births File Children and Parenting File Wave IV Weights Wave IV Public Use Biomarkers, Glucose Data Wave IV Public Use Biomarkers, Measures of EBV and hsCRP Wave IV Public Use Biomarkers, Lipids Data
Discription of data quality
The dataset combines longitudinal survey data on respondents’ social, economic, psychological and physical well-being with contextual data on the family, neighborhood, community, school, friendships, peer groups, and romantic relationships. The dataset has been followed into young adulthood with four in-home interviews, the most recent in 2008.
The quality of the data is high and it is considered to be one of the most comprehensive datasets on adolescent health and development. The dataset is also publicly available for researchers to use. However, there are some known sources of errors or biases. For example, the sample is not representative of all adolescents in the United States because it excludes those who dropped out of school before grade 7 or who were not enrolled in school during the 1994-95 school year. Additionally, there may be some measurement error due to self-reported data. Add Health oversampled schools with larger proportions of black and Hispanic students. Additionally, Add Health did not include students who were not enrolled in school at the time of Wave I.
The dataset is maintained at Odum Institute Data Archive
The Odum Institute Data Archive is a research data stewardship organization that provides long-term preservation and stewardship of research data assets to broaden scientific inquiry, promote research reproducibility, and foster data fluency. The archive is home to one of the largest catalogs of social science research data in the U.S., including the Harris Polls, North Carolina Vital Statistics, and the most complete collection of 1970s U.S. Census data. The institute offers services for data management plan development and implementation, finding & accessing data, data management training & education, and data curation for reproducibility training. UNC Dataverse is a web-based data repository that enables scientists, research teams, scholarly journals, and other members of the UNC research community to archive and share their own datasets .
There are several data mining techniques that are used to extract useful information from large datasets. Some of the most popular ones include:
Association analysis: This technique is used to find association rules showing attribute-value conditions that occur frequently together in a given set of data. It is widely used for market basket or transaction data analysis.
Classification: This technique is used to classify data into predefined classes or categories based on their attributes.
Prediction: This technique is used to predict future trends or values based on historical data.
Clustering: This technique is used to group similar data points together based on their attributes.
Regression: This technique is used to establish a relationship between two or more variables in a dataset.
Artificial Neural Network (ANN) Classifier Method: This technique is used to classify data into predefined classes or categories based on their attributes using an artificial neural network.
Outlier Detection: This technique is used to identify unusual patterns or observations in a dataset.
Genetic Algorithm: This technique is used to optimize complex problems by simulating the process of natural selection.
References of data mining techniques:
https://www.geeksforgeeks.org/data-mining-techniques/
https://www.investopedia.com/terms/d/datamining.asp
https://www.ibm.com/topics/data-mining
https://www.javatpoint.com/data-processing-in-data-mining
https://www.springboard.com/blog/data-science/data-mining/
How are various features related to educational resources, health resources and various opportunities infulencing participants highest education level?
In this study following methods were proposed
Regression analysis is proposed to understand various patterns in adolescent to adult health patterns. Logistic regression is proposed to understand relationships between binary target variable and various features.
Classification is proposed to understand Y/N type participants education level patterns
Prediction: is proposed to check if the classification and regression models can predict participants education level patterns.
Feature Selection: Feature selection techniques help in choosing the most relevant and informative features for building models, reducing complexity and improving model performance.
Ensemble Methods: Ensemble methods combine multiple models to improve prediction accuracy and reduce overfitting. Bagging (e.g., Random Forest) and Boosting (e.g., Gradient Boosting Machines) are common ensemble techniques.
#import required packages
import numpy as np
import pandas as pd
df_1994 = pd.read_sas('F:/a_Harrisburg_University_Academics/CISC 520-50-O-2023Summer- Big Data Mining/Term Project/01 Dataset selection/Datasets/dataverse_files_wave1-1994/w1inhome.sas7bdat')
df_1994
df_1994.shape
df_1994.describe()
df_2008 = pd.read_sas('F:/a_Harrisburg_University_Academics/CISC 520-50-O-2023Summer- Big Data Mining/Term Project/01 Dataset selection/Datasets/dataverse_files/w4inhome.sas7bdat')
df_2008 = pd.DataFrame(data=df_2008)
df_2008
df_2008.shape
df_2008.describe()
Final dataset will be a combination of both wave I and wave IV datasets. The aim is to understand factors leading to Adolescent to Adult success. Understand how behavior patterns, traits, enviornment, parenting etc affects success and healthly life.
#!pip install cufflinks plotly
#!pip install chart-studio
#!pip install plotly jupyterlab --user
# importing some plotly modules
import chart_studio
import plotly.graph_objs as go
from plotly.offline import iplot, init_notebook_mode
# importing cufflinksand setting to enable using plotly offline
import cufflinks
cufflinks.go_offline(connected=True)
init_notebook_mode(connected=False) # connected = True for using online mode.
import plotly.express as px
# conventional plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns
## reference : plotly errors : https://stackoverflow.com/questions/52771328/plotly-chart-not-showing-in-jupyter-notebook
#display all column names of DataFrame df_1994 for reference
# print(df_1994.columns.tolist())
Since the dataset is very big and computation time is large, we will plot limited number of columns.
Here df_corr is a correlation matrix for :n columns.
import copy
df_1994_copy = copy.deepcopy(df_1994)
# df_corr = df_1994_copy.iloc[:, :30].corr() # Generate correlation matrix of first 30 columns ..
# # fig = go.Figure()
# # layout = {"title": "Add Title in plotly"}
# # fig.add_trace(
# # go.Heatmap(
# # x = df_corr.columns,
# # y = df_corr.index,
# # z = np.array(df_corr)
# # )
# # )
# fig = go.Figure(
# data=go.Heatmap(
# x = df_corr.columns,
# y = df_corr.index,
# z = np.array(df_corr)
# ),
# layout=go.Layout(
# title="Correlation matrix of n columns",
# xaxis=dict(title='corr matrix columns'),
# yaxis=dict(title='corr matrix indexes')
# ),
# )
# fig.show()
# ## reference :
# ## https://en.ai-research-collection.com/plotly-heatmap/
# ## https://stackoverflow.com/questions/59059378/title-for-colorbar-in-plotly-heatmap
# df_1994['H1DA1'].iplot(kind='hist', xTitle='work', yTitle='count', title='how many times did the participant do work around the house')
H1GH60 - what is your weight?
range 50 to 360 pounds
996 = refused
998 = don't know
999 = not applicable
## how many times did the participant do work around the house
%matplotlib inline
df_1994["H1GH60"].plot(kind="hist")
df_1994_weight = df_1994[(df_1994['H1GH60']>=0) & (df_1994['H1GH60']<=900)] ## removing 996, 998, 999
#df_1994_weight["H1GH60"].plot(kind="hist")
df_1994_weight['H1GH60'].iplot(kind='hist', xTitle='Weight pounds', yTitle='count', title='Participants weight in pounds')
H1DA8 is a quantitative variable. But 996 and 998 are coded as refused and dont know. Removing 996 and 998
H1DA8 : How many hours a week do you watch television?
0 = does not wach TV
1 to 99 = range 1 to 99 hours
996 = refused
998 = don't know
df_1994_TV = df_1994[(df_1994['H1DA8']>=0) & (df_1994['H1DA8']<=900)]
df_1994_TV['H1DA8'].iplot(kind='hist', xTitle='number of hrs watched TV', yTitle='count', title='How many hours did the participant watch TV?')
#!pip install --upgrade seaborn
Variables under investigation
H1ED12 : what was your grade in mathematics?
1 = A
2 = B
3 = C
4 = D or lower
5 = didnt take math
6 = different grading pattern
96 = refused
97 = skipped
98 = dont know
H1DA8 : How many hours a week do you watch television?
0 = does not wach TV
1 to 99 = range 1 to 99 hours
996 = refused
998 = don't know
BIO_SEX: Interviewer, please confirm that R’s sex is (male) female
1 = male
2 = female
6 = refused
sns.barplot(x="H1ED12", y="H1DA8", data=df_1994, ci=None)
plt.xlabel('Grade: 1 = A, 2 = B, 3 = C, 4 = D or lower, 5 = didnt take math, 6 = different grading pattern, 96 = refused, 97 = skipped, 98 = dont know ')
plt.ylabel('TV watching hours hrs/week')
Converting grades B or less into one category and A into another category for analysis
df_1994_copy = df_1994_copy[(df_1994_copy['H1DA8']>=0) & (df_1994_copy['H1DA8']<=99)]
df_1994_copy.loc[df_1994_copy['H1ED12'] != 1 , 'H1ED12'] = 0
sns.barplot(x="H1ED12", y="H1DA8", data=df_1994_copy, errorbar=('ci', True))
plt.xlabel('Grade:\n 0 = B or less \n 1 = A')
plt.ylabel('TV watching hours hrs/week')
fig3 = px.box(
data_frame = df_1994_copy
,y = 'H1DA8'
,x = 'H1ED12'
,color = 'BIO_SEX'
,labels={
"H1DA8": "TV watching hours hrs/week",
"H1ED12": "Grade:1 = A ; \n 0 = B or less \n ",
"BIO_SEX": "Gender <br> 1 = male <br> 2 = female"}
)
fig3.show()
#df_1994_copy.iplot(x='H1ED12', y='H1DA8', categories='BIO_SEX', xTitle='Grade:\n 0 = B or less \n 1 = A', yTitle='TV watching hours hrs/week', title='Time spend watching TV vs Grades')
Students with A grade in math tend to spend less hours watching TV
df_merged = pd.merge(df_1994, df_2008, on="AID", how="left")
df_merged.shape
2794+920-1 ## Number of columns = df_1994 + df_2008 and number of rows = df_1994
To understand depression, some of the follwoing features were investigated:
df_bmi_94 = df_merged[(df_merged['H1GH60']>=0) & (df_merged['H1GH60']<=900)][['AID', 'H1GH60','H1GH59A','H1GH59B']]
df_bmi_94 = df_bmi_94[(df_bmi_94['H1GH59B']>=0) & (df_bmi_94['H1GH59B']<=90)]
df_bmi_94 = df_bmi_94[(df_bmi_94['H1GH59A']>=0) & (df_bmi_94['H1GH59A']<=90)]
df_bmi_94['HEIGHT94'] = df_bmi_94['H1GH59A']*12 + df_bmi_94['H1GH59B']
df_bmi_94['BMI94'] = 703*df_bmi_94['H1GH60'] / (df_bmi_94['HEIGHT94'])**2
df_bmi_94 = df_bmi_94[['AID', 'H1GH60','HEIGHT94','BMI94']]
df_bmi_94['H1GH60'] = df_bmi_94['H1GH60']*0.45359237 # convert weight t0 kg
df_bmi_94['HEIGHT94'] = df_bmi_94['HEIGHT94']*2.54
df_bmi_94
df_bmi_08 = df_merged[(df_merged['H4WGT']>=0) & (df_merged['H4WGT']<=880)][['AID', 'H4WGT','H4HGT']]
df_bmi_08 = df_bmi_08[(df_bmi_08['H4HGT']>=0) & (df_bmi_08['H4HGT']<=900)]
df_bmi_08['BMI08'] = df_bmi_08['H4WGT'] / (df_bmi_08['H4HGT']/100)**2
df_bmi_08
df_bmi = pd.merge(df_bmi_94, df_bmi_08, on="AID", how="left").rename(
columns={'H4WGT':'WEIGHT_08','H4HGT':'HEIGHT_08','BMI08':'BMI_08',
'H1GH60':'WEIGHT_94','HEIGHT94':'HEIGHT_94','BMI94':'BMI_94'})
# Data Visualization
num_feats = df_bmi.select_dtypes(include=[np.number])
plt.figure(figsize=(12, 8))
for i, feature in enumerate(num_feats.columns):
plt.subplot(3, 3, i+1)
sns.histplot(data=num_feats, x=feature, kde=True)
plt.title(f"{feature} Distribution")
plt.tight_layout()
plt.show()
# BIO_SEX
df_sex = df_merged[['AID','BIO_SEX']]
df_sex
df_bmi_sex = pd.merge(df_bmi, df_sex, on="AID", how="left")
# Box plots with numerical independent variables against the dependent variable 'HeartDisease'
numerical_features = df_bmi_sex.select_dtypes(include=[np.number])
# Number of rows and columns in the numerical features DataFrame
num_rows, num_cols = numerical_features.shape
# Number of subplots required (excluding the dependent variable 'HeartDisease')
num_plots = num_cols - 1
# Setting the size of the box plot grid
plt.figure(figsize=(12, 8))
# Loop through each numerical independent variable and create the box plot with hue as 'HeartDisease'
for i, feature in enumerate(numerical_features.columns[:-1]):
plt.subplot(2, 3, i + 1)
sns.boxplot(data=df_bmi_sex, x='BIO_SEX', y=feature, palette='coolwarm')
plt.title(f"{feature} vs sex")
plt.xlabel('sex 1:male, 2:Female')
plt.ylabel(feature)
plt.tight_layout()
plt.show()
# Delta BMI
df_deltaBMI = df_bmi_sex
df_deltaBMI['deltaBMI'] = df_bmi_sex['BMI_08']-df_bmi_sex['BMI_94']
df_deltaBMI = df_deltaBMI[['AID','BIO_SEX','BMI_08','deltaBMI']]
df_deltaBMI
num_feats = df_deltaBMI.select_dtypes(include=[np.number])
plt.figure(figsize=(12, 8))
for i, feature in enumerate(num_feats.columns):
plt.subplot(3, 3, i+1)
sns.histplot(data=num_feats, x=feature, kde=True)
plt.title(f"{feature} Distribution")
plt.tight_layout()
plt.show()
## Some participants have undergone reduction in BMI
df_merged = pd.merge(df_merged,df_deltaBMI[['AID','BMI_08','deltaBMI']], on="AID", how="left") #df_deltaBMI[['AID','deltaBMI']]
df_merged.shape
df_merged['BMI_08']
Selecting following features from datases.
| Feature | Description | Type |
|---|---|---|
| AID | Participant unique identifier | Object |
| BIO_SEX | Gender 1 R is male, 2 R is female, 6 refused | Categorical |
| H4OD1Y | Respondent's date of birth – year 1974 - 1983 | Numerical |
| H1DA6 | During the past week, how many times did you do exercise: 0 : not at all, 1 : 1 or 2 times, 2 : 3 or 4 times, 3 : 5 or more times, 6 : refused,8 : don’t know | Categorical |
| H4DA1 | how many hours did you watch television? 1-150 hours, 996: refused, 998: dont know | Numerical |
| H4SP1H | what time do you usually wake up? 1-12 hours, 96: refused, 98: dont know | Numerical |
| H4PE7 | I'm always optimistic about my future- 1: strongly agree ,2: agree ,3: neither agree nor disagree ,4: disagree ,5: strongly disagree ,6: refused ,8: don't know , .: missing | Categorical |
| H4GH1 | how is your health?-1: excellent ,2: very good ,3: good ,4: fair ,5: poor | Categorical |
| H1ED14 | Grade in science? - 1: A ,2: B ,3: C ,4: D or lower ,5: didn’t take this subject ,6: took the subject, but it wasn’t graded this way ,96: refused ,97: legitimate skip ,98: don’t know | Categorical |
| H1ED13 | Grade in history or social studies?:A ,2: B ,3: C ,4: D or lower ,5: didn’t take this subject ,6: took the subject, but it wasn’t graded this way ,96: refused ,97: legitimate skip ,98: don’t know | Categorical |
| H1ED12 | Grade in mathematics? A ,2: B ,3: C ,4: D or lower ,5: didn’t take this subject ,6: took the subject, but it wasn’t graded this way ,96: refused ,97: legitimate skip ,98: don’t know | Categorical |
| H1GH51 | How many hours of sleep do you usually get? 1-20 hours ,96: refused ,98: don’t know | Numerical |
| H1DA8 | How many hours a week do you watch television? 0 hrs, 1-99 hrs, 996 refused, 998 don't know | Numerical |
| H1GH1 | how is your health? 1: excellent ,2: very good ,3: good ,4: fair ,5: poor, 6:refused, 8: don't know | Categorical |
| H1DA10 | How many hours a week do you play video or computer games? 0: don't play, 1 - 99 hrs, 996: refused, 998 don't know | Numerical |
| BMI_08 | Bmi in 2008 calculated in above steps | Numerical |
| deltaBMI | Change Bmi from 1994 to 2008 as calculated in above steps | Numerical |
| H4ED2 | (TARGET) highest level of education: 1: 8th grade or less ,2: some high school ,3: high school graduate ,4: some vocational/technical training (after high school) ,5: completed vocational/technical training (after high school) ,6: some college ,7: completed college (bachelor's degree) ,8: some graduate school ,9: completed a master's degree ,10: some graduate training beyond a master's degree ,11: completed a doctoral degree ,12: some post baccalaureate professional education (e.g., law school) ,13: completed post baccalaureate professional education (e.g., law school, med school, nurse) ,98: don't know | Categorical |
These features were randomly selected based on related or unrelated work from peer reviewed references.
df_study = df_merged[(df_merged['BIO_SEX']>=0) & (df_merged['BIO_SEX']<=5)][['AID','H4OD1Y','BIO_SEX','H1DA6','H4DA1','H4SP1H', 'H4ED2','H4PE7','H4GH1','H1ED14','H1ED13','H1ED12','BMI_08','H1GH51','H1DA8','H1GH1','H1DA10','deltaBMI']]
df_study['BIO_SEX'] = df_study['BIO_SEX'].astype('int')
# H1DA8 (Wave 1) number of hours spent in watching television per week
df_study = df_study[(df_study['H1DA8']>=0) & (df_study['H1DA8']<995)] # remove skip and dont know
# H1DA6 (Wave 1) past week, how many times did you do exercise? recode to yes / no
df_study = df_study[(df_study['H1DA6']>=0) & (df_study['H1DA6']<6)] # remove skip and dont know
df_study.loc[df_study['H1DA6'] <1 , 'H1DA6'] = 0 # did not excerice
df_study.loc[df_study['H1DA6'] >=1 , 'H1DA6'] = 1 # excericed
df_study['H1DA6'] = df_study['H1DA6'].astype('int')
# H1DA10 (Wave 1) How many hours a week do you play video or computer games? recode to yes / no
df_study = df_study[(df_study['H1DA10']>=0) & (df_study['H1DA10']<995)] # remove skip and dont know
df_study.loc[df_study['H1DA10'] <=1 , 'H1DA10'] = 0 # did not play pc games
df_study.loc[df_study['H1DA10'] >1 , 'H1DA10'] = 1 # played
df_study['H1DA10'] = df_study['H1DA10'].astype('int')
# H1GH1 (Wave 1) n general, how is your health?
df_study = df_study[(df_study['H1GH1']>=0) & (df_study['H1GH1']<6)] # recode health status, remove skip and dont know
df_study.loc[df_study['H1GH1'] <=3 , 'H1GH1'] = 1 # good
df_study.loc[df_study['H1GH1'] >3 , 'H1GH1'] = 0 # bad
df_study['H1GH1'] = df_study['H1GH1'].astype('int')
# Studies and grades
# H1ED12 (Wave 1) what was your grade in mathematics?
df_study = df_study[(df_study['H1ED12']>=0) & (df_study['H1ED12']<5)] # recode grade status, remove skip and dont know
df_study.loc[df_study['H1ED12'] <=1 , 'H1ED12'] = 1 # A
df_study.loc[df_study['H1ED12'] >1 , 'H1ED12'] = 0 # B or less
df_study['H1ED12'] = df_study['H1ED12'].astype('int')
# H1ED14 (wave 1) what was your grade in science?
df_study = df_study[(df_study['H1ED14']>=0) & (df_study['H1ED14']<5)] # recode grade status, remove skip and dont know
df_study.loc[df_study['H1ED14'] <=1 , 'H1ED14'] = 1 # A
df_study.loc[df_study['H1ED14'] >1 , 'H1ED14'] = 0 # B or less
df_study['H1ED14'] = df_study['H1ED14'].astype('int')
# H1ED13 (wave 1) what was your grade in history or social studies?
df_study = df_study[(df_study['H1ED13']>=0) & (df_study['H1ED13']<5)] # recode grade status, remove skip and dont know
df_study.loc[df_study['H1ED13'] <=1 , 'H1ED13'] = 1 # A
df_study.loc[df_study['H1ED13'] >1 , 'H1ED13'] = 0 # B or less
df_study['H1ED13'] = df_study['H1ED13'].astype('int')
#Health
# H4OD1Y (Wave 4) Birth year
df_study = df_study[(df_study['H4OD1Y']>=0) & (df_study['H4OD1Y']<1983)] # remove skip and dont know
df_study['H4OD1Y'] = 2008 - df_study['H4OD1Y'] # remove skip and dont know
# H1GH51 How many hours of sleep do you usually get?
df_study = df_study[(df_study['H1GH51']>=0) & (df_study['H1GH51']<86)] # recode sleep hours, remove skip and dont know
# H4SP1H (Wave 4) On the days you go to work, school or similar activities, what time do you usually wake up? [Hour]
df_study = df_study[(df_study['H4SP1H']>=0) & (df_study['H4SP1H']<86)] # recode sleep hours, remove skip and dont know
# H4GH1 (wave 4) In general, how is your health?
df_study = df_study[(df_study['H4GH1']>=0) & (df_study['H4GH1']<6)] # recode grade status, remove skip and dont know
df_study.loc[df_study['H4GH1'] <=3 , 'H4GH1'] = 1 # great/good
df_study.loc[df_study['H4GH1'] >3 , 'H4GH1'] = 0 # poor
df_study['H4GH1'] = df_study['H4GH1'].astype('int')
# H4PE7 (wave 4) I'm always optimistic about my future
df_study = df_study[(df_study['H4PE7']>=0) & (df_study['H4PE7']<6)] # recode grade status, remove skip and dont know
df_study.loc[df_study['H4PE7'] <3 , 'H4PE7'] = 1 # agree
df_study.loc[df_study['H4PE7'] >=3 , 'H4PE7'] = 0 # disagress or niether
df_study['H4PE7'] = df_study['H4PE7'].astype('int')
# H4DA1 In the past seven days, how many hours did you watch television?
df_study = df_study[(df_study['H4DA1']>=0) & (df_study['H4DA1']<995)] # recode sleep hours, remove skip and dont know
#target variable > highest level of education
# H4ED2 (Wave 4) highest level of education recode education status <target categorical
df_study = df_study[(df_study['H4ED2']>=0) & (df_study['H4ED2']<97)] # recode education status, remove skip and dont know
df_study.loc[df_study['H4ED2'] <=6 , 'H4ED2'] = 0 # some college or less
df_study.loc[df_study['H4ED2'] >6 , 'H4ED2'] = 1 # graduate or more
df_study['H4ED2'] = df_study['H4ED2'].astype('int')
num_feats = df_study.select_dtypes(include=[np.number])
plt.figure(figsize=(12, 8))
for i, feature in enumerate(num_feats.columns):
plt.subplot(5, 4, i+1)
sns.histplot(data=num_feats, x=feature, kde=True)
plt.title(f"{feature} Distribution")
plt.tight_layout()
plt.show()
# df_study.dtypes
df_study.shape
df_study.isna().sum()
data = df_study.copy()
corr = data.corr()
print(corr)
sns.heatmap(corr.round(2) , annot=True)
df_study.dropna(inplace=True)
df_study.isna().sum()
from sklearn.preprocessing import scale
# Normalize the Data
X = df_study.drop(['AID', 'H4ED2'], axis=1) ## Need to drop Unnamed: 0
x = scale(X)
inputs = pd.DataFrame(x)
print(inputs.head())
inputs.columns = X.columns
target = pd.DataFrame(df_study[['H4ED2']])
target.reset_index(drop=True, inplace=True) # reset index of target dataframe
data = pd.concat([inputs, target], axis=1).reindex(inputs.index)
print(data.head())
# Box plots with numerical independent variables against the dependent variable 'HeartDisease'
numerical_features = data.select_dtypes(include=[np.number])
# Number of rows and columns in the numerical features DataFrame
num_rows, num_cols = numerical_features.shape
# Number of subplots required (excluding the dependent variable 'HeartDisease')
num_plots = num_cols - 1
# Setting the size of the box plot grid
plt.figure(figsize=(12, 8))
# Loop through each numerical independent variable and create the box plot with hue as 'HeartDisease'
for i, feature in enumerate(numerical_features.columns[:-1]):
plt.subplot(4, 4, i + 1)
sns.boxplot(data=data, x='H4ED2', y=feature, palette='coolwarm')
plt.title(f"{feature} vs H4ED2")
plt.xlabel('H4ED2 0:No, 1:Yes')
plt.ylabel(feature)
plt.tight_layout()
plt.show()
# Inter quartile range
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
## Number of outliers
((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))).sum()
## remove outliers from continuous features
Q1 = data[['BMI_08', 'H1GH51', 'H1DA8', 'deltaBMI']].quantile(0.25)
Q3 = data[['BMI_08', 'H1GH51', 'H1DA8', 'deltaBMI']].quantile(0.75)
IQR = Q3 - Q1
# Create masks for each column separately
mask_bmi = (data['BMI_08'] < (Q1['BMI_08'] - 1.5 * IQR['BMI_08'])) | (data['BMI_08'] > (Q3['BMI_08'] + 1.5 * IQR['BMI_08']))
mask_h1gh51 = (data['H1GH51'] < (Q1['H1GH51'] - 1.5 * IQR['H1GH51'])) | (data['H1GH51'] > (Q3['H1GH51'] + 1.5 * IQR['H1GH51']))
mask_h1da8 = (data['H1DA8'] < (Q1['H1DA8'] - 1.5 * IQR['H1DA8'])) | (data['H1DA8'] > (Q3['H1DA8'] + 1.5 * IQR['H1DA8']))
mask_deltabmi = (data['deltaBMI'] < (Q1['deltaBMI'] - 1.5 * IQR['deltaBMI'])) | (data['deltaBMI'] > (Q3['deltaBMI'] + 1.5 * IQR['deltaBMI']))
# Combine masks using logical OR (|) operator
mask_combined = mask_bmi | mask_h1gh51 | mask_h1da8 | mask_deltabmi
# # Replace outliers with NaN
# data[['BMI_08', 'H1GH51', 'H1DA8', 'deltaBMI']][mask_combined] = np.nan
# Replace outliers with NaN using .loc
data.loc[mask_combined, ['BMI_08', 'H1GH51', 'H1DA8', 'deltaBMI']] = np.nan
print(Q1,Q3,IQR)
data.isna().sum()
## remove outliers
data.dropna(inplace=True)
data.isna().sum()
print("-------Total Data Count------- \n\n",data.count())
print("\n\n\n-------Total Outlier Count - Pre Masking------- \n\n",mask_combined.sum())
print("\n\n\n-------Total Outlier Count - Post Masking------- \n\n",data.isna().sum())
# Box plots with numerical independent variables against the dependent variable 'HeartDisease'
numerical_features = data.select_dtypes(include=[np.number])
# Number of rows and columns in the numerical features DataFrame
num_rows, num_cols = numerical_features.shape
# Number of subplots required (excluding the dependent variable 'HeartDisease')
num_plots = num_cols - 1
# Setting the size of the box plot grid
plt.figure(figsize=(12, 8))
# Loop through each numerical independent variable and create the box plot with hue as 'HeartDisease'
for i, feature in enumerate(numerical_features.columns[:-1]):
plt.subplot(4, 4, i + 1)
sns.boxplot(data=data, x='H4ED2', y=feature, palette='coolwarm')
plt.title(f"{feature} vs H4ED2")
plt.xlabel('H4ED2 0:No, 1:Yes')
plt.ylabel(feature)
plt.tight_layout()
plt.show()
data = data.copy()
corr = data.corr()
print(corr)
sns.heatmap(corr.round(2) , annot=True)
df_depr=data.copy() ## data is scaled
# df_depr=df_study.copy() ## data is unscaled
df_depr.isna().sum()
df_depr.dropna(inplace=True)
df_depr.isna().sum()
df_depr.describe()
num_feats = df_depr.select_dtypes(include=[np.number])
plt.figure(figsize=(12, 8))
for i, feature in enumerate(num_feats.columns):
plt.subplot(5, 4, i+1)
sns.histplot(data=num_feats, x=feature, kde=True)
plt.title(f"{feature} Distribution")
plt.tight_layout()
plt.show()
Carrying out feature engineering to find out top 3 features
# Train each of the Models
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
The dataset target is distributed unevenly.
df_depr.shape
print("Number of participants with education level college or lower (H4ED2 == 0):", (df_depr['H4ED2']==0).sum())
print("Number of participants with education level university or higher(H4ED2 == 1):", (df_depr['H4ED2']==1).sum())
# https://elitedatascience.com/imbalanced-classes
from sklearn.utils import resample
#['unacc', 'acc', 'vgood', 'good']
# Separate majority and minority classes
Depress_no = df_depr[df_depr['H4ED2']==0]
Depress_yes = df_depr[df_depr['H4ED2']==1]
rand_state = 123
# Upsample minority class and downsampling majority class
Depress_no_upsampled = resample(Depress_no,
replace=False, # sample with replacement
n_samples=1500, # to match majority class
random_state=rand_state) # reproducible results
Depress_yes_upsampled = resample(Depress_yes,
replace=False, # sample with replacement
n_samples=(df_depr['H4ED2']==1).sum(), # to match majority class
random_state=rand_state) # reproducible results
Depress_yes_upsampled_repeat = resample(Depress_yes,
replace=True, # sample with replacement
n_samples=1500 - (df_depr['H4ED2']==1).sum(), # to match majority class
random_state=rand_state) # reproducible results
# Combine majority class with upsampled minority class
df_depr_upsampled = pd.concat([Depress_no_upsampled, Depress_yes_upsampled, Depress_yes_upsampled_repeat])
# Display new class counts
df_depr_upsampled['H4ED2'].value_counts()
data = df_depr_upsampled.copy()
corr = data.corr()
# print(corr)
sns.heatmap(corr.round(2) , annot=True)
num_feats = df_depr_upsampled.select_dtypes(include=[np.number])
plt.figure(figsize=(12, 8))
for i, feature in enumerate(num_feats.columns):
plt.subplot(5, 4, i+1)
sns.histplot(data=num_feats, x=feature, kde=True)
plt.title(f"{feature} Distribution")
plt.tight_layout()
plt.show()
df_depr_upsampled.isna().sum()
# print(df_depr_upsampled.dtypes)
# df_depr_upsampled['H4ED2'] = df_depr_upsampled['H4ED2'].astype(int)
# print(df_depr_upsampled.dtypes)
X = df_depr_upsampled.drop(columns = ['H4ED2'])
y = df_depr_upsampled['H4ED2']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)
from sklearn.feature_selection import mutual_info_classif
import numpy as np
# Assuming you have X and y data loaded
# Calculate the mutual information scores
mi_scores = mutual_info_classif(X, y)
# Create a dictionary of feature importance scores
feature_scores = {}
for i, feature in enumerate(X.columns):
feature_scores[feature] = mi_scores[i]
# Sort the features by importance score in descending order
sorted_features = sorted(feature_scores.items(), key=lambda x: x[1], reverse=True)
# Print the sorted features
for feature, score in sorted_features:
print('Feature:', feature, 'Score:', score)
# ________________________________mutual_info_classif____________________________________________________________________
# Plot a horizontal bar chart of the feature importance scores
fig, ax = plt.subplots()
y_pos = np.arange(len(sorted_features))
ax.barh(y_pos, [score for feature, score in sorted_features], align="center")
ax.set_yticks(y_pos)
ax.set_yticklabels([feature for feature, score in sorted_features])
ax.invert_yaxis() # Labels read top-to-bottom
ax.set_xlabel("Importance Score")
ax.set_title("Feature Importance Scores (Information Gain)")
# Add importance scores as labels on the horizontal bar chart
for i, v in enumerate([score for feature, score in sorted_features]):
ax.text(v + 0.01, i, str(round(v, 3)), color="black", fontweight="bold")
plt.show()
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.feature_selection import mutual_info_regression
# Apply Information Gain
ig = mutual_info_regression(X, y)
# Create a dictionary of feature importance scores
feature_scores = {}
for i in range(len(X.columns)):
feature_scores[X.columns[i]] = ig[i]
# Sort the features by importance score in descending order
sorted_features = sorted(feature_scores.items(), key=lambda x: x[1], reverse=True)
# Print the feature importance scores and the sorted features
for feature, score in sorted_features:
print('Feature:', feature, 'Score:', score)
# ________________________________________mutual_info_regression___________________________________________________________
# Plot a horizontal bar chart of the feature importance scores
fig, ax = plt.subplots()
y_pos = np.arange(len(sorted_features))
ax.barh(y_pos, [score for feature, score in sorted_features], align="center")
ax.set_yticks(y_pos)
ax.set_yticklabels([feature for feature, score in sorted_features])
ax.invert_yaxis() # Labels read top-to-bottom
ax.set_xlabel("Importance Score")
ax.set_title("Feature Importance Scores (Information Gain)")
# Add importance scores as labels on the horizontal bar chart
for i, v in enumerate([score for feature, score in sorted_features]):
ax.text(v + 0.01, i, str(round(v, 3)), color="black", fontweight="bold")
plt.show()
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LinearRegression # Replace with your desired model
# Assuming you have X and y data loaded
# Create an instance of the model you want to use for feature selection
model = LinearRegression() # Replace with your desired model
# Create the RFECV object with your model and scoring metric
rfecv = RFECV(estimator=model, scoring='neg_mean_squared_error', cv=10)
# Fit the RFECV to your data
rfecv.fit(X, y)
# Create a dictionary of feature importance scores
feature_scores = {}
for i, feature in enumerate(X.columns):
feature_scores[feature] = rfecv.support_[i]
# Sort the features by importance score in descending order
sorted_features = sorted(feature_scores.items(), key=lambda x: x[1], reverse=True)
# Print the sorted features
for feature, score in sorted_features:
print('Feature:', feature, 'Selected:', score)
# a
df_depr_upsampled.columns
X = df_depr_upsampled.drop(columns = ['H4ED2','BMI_08','H1GH51'])
y = df_depr_upsampled['H4ED2']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=20)
from sklearn.feature_selection import mutual_info_classif
import numpy as np
# Assuming you have X and y data loaded
# Calculate the mutual information scores
mi_scores = mutual_info_classif(X, y)
# Create a dictionary of feature importance scores
feature_scores = {}
for i, feature in enumerate(X.columns):
feature_scores[feature] = mi_scores[i]
# Sort the features by importance score in descending order
sorted_features = sorted(feature_scores.items(), key=lambda x: x[1], reverse=True)
# Print the sorted features
for feature, score in sorted_features:
print('Feature:', feature, 'Score:', score)
# ________________________________mutual_info_classif____________________________________________________________________
# Plot a horizontal bar chart of the feature importance scores
fig, ax = plt.subplots()
y_pos = np.arange(len(sorted_features))
ax.barh(y_pos, [score for feature, score in sorted_features], align="center")
ax.set_yticks(y_pos)
ax.set_yticklabels([feature for feature, score in sorted_features])
ax.invert_yaxis() # Labels read top-to-bottom
ax.set_xlabel("Importance Score")
ax.set_title("Feature Importance Scores (Information Gain)")
# Add importance scores as labels on the horizontal bar chart
for i, v in enumerate([score for feature, score in sorted_features]):
ax.text(v + 0.01, i, str(round(v, 3)), color="black", fontweight="bold")
plt.show()
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.feature_selection import mutual_info_regression
# Apply Information Gain
ig = mutual_info_regression(X, y)
# Create a dictionary of feature importance scores
feature_scores = {}
for i in range(len(X.columns)):
feature_scores[X.columns[i]] = ig[i]
# Sort the features by importance score in descending order
sorted_features = sorted(feature_scores.items(), key=lambda x: x[1], reverse=True)
# Print the feature importance scores and the sorted features
for feature, score in sorted_features:
print('Feature:', feature, 'Score:', score)
# ________________________________________mutual_info_regression___________________________________________________________
# Plot a horizontal bar chart of the feature importance scores
fig, ax = plt.subplots()
y_pos = np.arange(len(sorted_features))
ax.barh(y_pos, [score for feature, score in sorted_features], align="center")
ax.set_yticks(y_pos)
ax.set_yticklabels([feature for feature, score in sorted_features])
ax.invert_yaxis() # Labels read top-to-bottom
ax.set_xlabel("Importance Score")
ax.set_title("Feature Importance Scores (Information Gain)")
# Add importance scores as labels on the horizontal bar chart
for i, v in enumerate([score for feature, score in sorted_features]):
ax.text(v + 0.01, i, str(round(v, 3)), color="black", fontweight="bold")
plt.show()
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LinearRegression # Replace with your desired model
# Assuming you have X and y data loaded
# Create an instance of the model you want to use for feature selection
model = LinearRegression() # Replace with your desired model
# Create the RFECV object with your model and scoring metric
rfecv = RFECV(estimator=model, scoring='neg_mean_squared_error', cv=10)
# Fit the RFECV to your data
rfecv.fit(X, y)
# Create a dictionary of feature importance scores
feature_scores = {}
for i, feature in enumerate(X.columns):
feature_scores[feature] = rfecv.support_[i]
# Sort the features by importance score in descending order
sorted_features = sorted(feature_scores.items(), key=lambda x: x[1], reverse=True)
# Print the sorted features
for feature, score in sorted_features:
print('Feature:', feature, 'Selected:', score)
removed features with least scores
## Building 3 supervised models
# dt = DecisionTreeClassifier()
# dt.fit(X, y)
# rf = RandomForestClassifier()
# rf.fit(X, y)
# lr = LogisticRegression()
# lr.fit(X, y)
# imoprtance scores of features using decision tree
# importance = dt.feature_importances_
# for i,v in enumerate(importance):
# print('Feature: %0d, Score: %.5f' % (i,v))
# imoprtance scores of features using random forest
# importance = rf.feature_importances_
# for i,v in enumerate(importance):
# print('Feature: %0d, Score: %.5f' % (i,v))
# imoprtance scores of features using logitic regression
# importance = lr.coef_[0]
# for i,v in enumerate(importance):
# print('Feature: %0d, Score: %.5f' % (i,v))
Buid ensemble model using following 4 models
# Model 1 : KNeighbors
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
params_knn = {'n_neighbors': np.arange(1, 25)}
knn_gs = GridSearchCV(knn, params_knn, cv=10)
knn_gs.fit(X_train, y_train)
knn_best = knn_gs.best_estimator_
print(knn_gs.best_params_)
# import metrics class
from sklearn import metrics
y_pred_proba = knn_gs.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="Predict Algorithm, auc="+str(round(auc,3)))
plt.legend(loc=4)
plt.show()
y_pred = knn_gs.predict(X_test)
# import metrics class
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix
print("Sensitivity",metrics.recall_score(y_test, y_pred))
print("Specificity:",metrics.precision_score(y_test, y_pred))
# Model 2 : RandomForest
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
params_rf = {'n_estimators': [50, 100, 200]}
rf_gs = GridSearchCV(rf, params_rf, cv=10)
rf_gs.fit(X_train, y_train)
#Store the Second Model’s Results
rf_best = rf_gs.best_estimator_
print(rf_gs.best_params_)
y_pred_proba = rf_gs.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="Predict Algorithm, auc="+str(round(auc,3)))
plt.legend(loc=4)
plt.show()
y_pred = rf_gs.predict(X_test)
# import metrics class
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix
print("Sensitivity",metrics.recall_score(y_test, y_pred))
print("Specificity:",metrics.precision_score(y_test, y_pred))
# Model 3: Logistic regression
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
# logreg = LogisticRegression()
# logModel = logreg.fit(X_train,y_train)
y_pred_proba = logreg.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="Predict Algorithm, auc="+str(round(auc,3)))
plt.legend(loc=4)
plt.show()
y_pred = logreg.predict(X_test)
# import metrics class
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix
print("Sensitivity",metrics.recall_score(y_test, y_pred))
print("Specificity:",metrics.precision_score(y_test, y_pred))
# GaussianNB
from sklearn.naive_bayes import GaussianNB
model_gb = GaussianNB()
model_gb.fit(X_train, y_train)
y_pred_proba = model_gb.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="Predict Algorithm, auc="+str(round(auc,3)))
plt.legend(loc=4)
plt.show()
y_pred = model_gb.predict(X_test)
# import metrics class
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix
print("Sensitivity",metrics.recall_score(y_test, y_pred))
print("Specificity:",metrics.precision_score(y_test, y_pred))
print('knn: {}'.format(knn_best.score(X_test, y_test)))
print('rf: {}'.format(rf_best.score(X_test, y_test)))
print('log_reg: {}'.format(logreg.score(X_test, y_test)))
print('gnb: {}'.format(model_gb.score(X_test, y_test)))
Ensemble model of all 4 models
# ensemble all models
from sklearn.ensemble import VotingClassifier
estimators=[('knn', knn_best), ('rf', rf_best), ('log_reg', logreg), ('gnb',model_gb)]
ensemble = VotingClassifier(estimators, voting='soft') # voting hard
ensemble.fit(X_train, y_train)
y_pred_proba = ensemble.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="Predict Algorithm, auc="+str(round(auc,5)))
plt.legend(loc=4)
plt.show()
y_pred = ensemble.predict(X_test)
# import metrics class
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix
print("Sensitivity",metrics.recall_score(y_test, y_pred))
print("Specificity:",metrics.precision_score(y_test, y_pred))
import matplotlib.pyplot as plt
# Assuming you have the arrays: y_pred, y_test, y_train, y_pred_train, all containing 0s and 1s
y_pred_train = ensemble.predict(X_train)
# Count the frequency of 0s and 1s in each array
unique_pred, counts_pred = np.unique(y_pred, return_counts=True)
unique_test, counts_test = np.unique(y_test, return_counts=True)
unique_train, counts_train = np.unique(y_train, return_counts=True)
unique_pred_train, counts_pred_train = np.unique(y_pred_train, return_counts=True)
# Create a grouped bar plot
width = 0.2 # Width of the bars
x = np.arange(len(unique_pred))
fig, ax = plt.subplots()
rects1 = ax.bar(x - 1.5*width, counts_pred, width, label='y_pred')
rects2 = ax.bar(x - 0.5*width, counts_test, width, label='y_test')
rects3 = ax.bar(x + 0.5*width, counts_train, width, label='y_train')
rects4 = ax.bar(x + 1.5*width, counts_pred_train, width, label='y_pred_train')
ax.set_xlabel('Predicted Value')
ax.set_ylabel('Frequency')
ax.set_title('Frequency of 0s and 1s in y_pred, y_test, y_train, and y_pred_train')
ax.set_xticks(x)
ax.set_xticklabels(unique_pred)
ax.legend()
plt.show()
Taking long time to execute code
Hyperparameter tuning is an essential step in optimizing the performance of a classification model.
# from sklearn.ensemble import RandomForestClassifier
# # Create the model
# model = RandomForestClassifier()
# # Define the Hyperparameter Grid:
# # Create a dictionary of hyperparameters and their corresponding values that you want to tune.
# # Choose a range of values to search for each hyperparameter.
# param_grid = {
# 'n_estimators': [100, 200, 300], # Number of trees in the forest
# 'max_depth': [None, 5, 10, 20], # Maximum depth of the tree
# 'min_samples_split': [2, 5, 10], # Minimum number of samples required to split an internal node
# 'min_samples_leaf': [1, 2, 4] # Minimum number of samples required to be at a leaf node
# }
# Perform Grid Search or Randomized Search:
# from sklearn.model_selection import GridSearchCV
# # Create the GridSearchCV object
# grid_search = GridSearchCV(estimator=model, param_grid=param_grid, scoring='accuracy', cv=5)
# # Fit the GridSearchCV to the training data
# grid_search.fit(X_train, y_train)
# # Get the best hyperparameters
# best_params = grid_search.best_params_
# print("Best Hyperparameters:", best_params)
# # Train the model with the best hyperparameters
# best_model = RandomForestClassifier(**best_params)
# best_model.fit(X_train, y_train)
# # Predict on the test set
# y_pred = best_model.predict(X_test)
# # Evaluate the model
# from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# accuracy = accuracy_score(y_test, y_pred)
# conf_matrix = confusion_matrix(y_test, y_pred)
# classification_rep = classification_report(y_test, y_pred)
# print("Accuracy:", accuracy)
# print("Confusion Matrix:\n", conf_matrix)
# print("Classification Report:\n", classification_rep)
17 features were shortlisted from the dataset for investigation of wether the participants will end up studying till "graduate level or more" or "stop at college level".
Features BMI_08 and BMI_94 were calculated based on particapants height and weight information. deltaBMI is difference between BMI_08 and BMI_94 indicating participants overall BMI / health trent.
Interquantile range method was used to remove outliers from numerical features. Q1 and Q3 quantiles were calculated to estimate IQR Interquantile range. Features had non-uniform units - kg, lbs, hrs, yes/no etc. The dataset was scaled to remove unnecesary influance of feature units.
According to feature scoring algorithm the top three features with mutual_info_classifier were H1ED13 Score: 0.057, BMI_08 Score: 0.037, H1ED14 Score: 0.036 mutual_info_regression were H1ED14 Score: 0.053, H4GH1 Score: 0.042,BMI_08 Score: 0.037 In addication, Recursive Feature Elimination, Cross-Validated (RFECV) feature selection method was given a try in the feature engineering process. Based on feature scores, two features were removed and remaining 15 features were used to train classification ML models.
An ensemle model ( Sensitivity 0.71, Specificity: 0.68 AUC: 0.77) was generated using using following 4 models - KNeighbors (score: 0.668) , RandomForest (score: 0.7133) , LogisticRegression (score: 0.678), Gaussian Naive Bayes(score: 0.68).
Hyperparameter tuning is an essential step in optimizing the performance of a classification model. This is a time intensive process and due to time constraints, this step was not implemented in this report. In future, Hyperparameter tuning could be implemented to improve performance of models.
Harris, Kathleen Mullan; Udry, Richard J., 2015, "National Longitudinal Study of Adolescent to Adult Health (Add Health) Wave I, 1994-1995", https://doi.org/10.15139/S3/11900, UNC Dataverse, V3 https://dataverse.unc.edu/dataset.xhtml?persistentId=doi:10.15139/S3/11900
Harris, Kathleen Mullan; Udry, Richard J., 2015, "National Longitudinal Study of Adolescent to Adult Health (Add Health) Wave IV, 2008", https://doi.org/10.15139/S3/11920, UNC Dataverse, V3. https://dataverse.unc.edu/dataset.xhtml?persistentId=doi:10.15139/S3/11920
Lecy, N., Osteen, P. The Effects of Childhood Trauma on College Completion. Res High Educ 63, 1058–1072 (2022). https://doi.org/10.1007/s11162-022-09677-9
The National Longitudinal Study of Adolescent to Adult Health (Add Health) https://addhealth.cpc.unc.edu/
ODUM INSTITUTE DATA ARCHIVE https://odum.unc.edu/archive/
Heatmaps in Python : https://plotly.com/python/heatmaps/
Setting the Font, Title, Legend Entries, and Axis Titles in Python: https://plotly.com/python/figure-labels/
EDA: https://github.com/SaurabhPrabhu94/ANLY-530-Group-Project-Heart/tree/main
Brown, D. W., Anda, R. F., Tiemeier, H., Felitti, V. J., Edwards, V. J., Croft, J. B., & Giles, W. H. (2009). Adverse childhood experiences and the risk of premature mortality. American Journal of Preventive Medicine, 37, 389–396. https://doi.org/10.1016/j.amepre.2009.06.021
01 Data Preprocessing I - Intro to Jupyter, Pandas Dataframe objects, and Handling Missing
02 Data Preprocessing II - DF Mechanics - Joining, Merging, and Concatenation
03 Visual Data Exploration
04 Frequent Pattern Mining and Unsupervised Machine Learning
05 Frequent Pattern Mining II
06 Generalized Linear Models, Supervised Machine Learning Intro and Classification